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Abstract 


It has been observed in a variety of contexts that gradient descent methods have great success 
in solving low-rank matrix factorization problems, despite the relevant problem formulation being 
non-convex. We tackle a particular instance of this scenario, where we seek the d-dimensional 
subspace spanned by a streaming data matrix. We apply the natural first order incremental gradient 
descent method, constraining the gradient method to the Grassmannian. In this paper, we propose 
an adaptive step size scheme that is greedy for the noiseless case, that maximizes the improvement 
of our metric of convergence at each data index t, and yields an expected improvement for the noisy 
case. We show that, with noise-free data, this method converges from any random initialization to 
the global minimum of the problem. For noisy data, we provide the expected convergence rate of 
the proposed algorithm per iteration. 

1. Introduction 

Low-rank matrix factorization is one of the foundational tools of signal processing, numerical meth¬ 
ods, and data analysis. Suppose we wish to factorize a matrix M = UW"^, imposing orthogonality 
constraints on U or W. Solving for such matrix factorizations can be computationally burdensome, 
and many algorithms that attempt to speed up computation are actually solving a non-convex opti¬ 
mization problem, therefore coming with few guarantees. 

The Singular Value Decomposition (SVD) is the solution to a non-convex optimization 
problem, and there are several highly successful algorithms for solving it Golub and Van Loan 
(2012). Unfortunately, these algorithms cannot easily be extended to problems with regulariz- 
ers or missing data. Recently, several results have been published with first-of-their-kind guar¬ 
antees for a variety of different gradient-type algorithms on non-convex matrix factorization prob¬ 
lems Jain et al. (2013); De Sa et al. (2014); Armentano et al. (2014); Chen and Wainwright (2015); 
Bhojanapalli et al. (2015); Zheng and Lafferty (2015). These new algorithms, being gradient-based, 
are well-suited to extensions of the original problem that include different cost functions or regular- 
izers. For example, with gradient methods to solve the SVD we may be able to solve Robust PCA 
Candes et al. (2011); He et al. (2012); Xu et al. (2010), Sparse PCA d’Aspremont et al. (2008), or 
even PCA Brooks et al. (2013) with gradient methods as well. 

Our contribution is to provide a global convergence result for d-dimensional subspace esti¬ 
mation using an incremental gradient algorithm performed on the Grassmannian, the space of all 
d-dimensional subspaces of W^. Subspace estimation is a special case of matrix factorization with 
orthogonality constraints, where we seek to estimate only the subspace spanned by the columns of 
the left matrix factor U € Our result demonstrates that this gradient algorithm converges 
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globally almost surely, i.e., it converges from any random initialization to the global minimizer. To 
the best of our knowledge, this is the first global convergence result for an incremental gradient de¬ 
scent method on the Grassmannian. When there is no noise, we propose a greedy step size scheme 
that maximizes the improvements on the defined metrics of convergence. Given this, we provide a 
rate of convergence in two parts: slower convergence in an initial phase starting from the random 
initialization, and then linear convergence for a local region around the global minimizer, where 
our results match those in Balzano and Wright (2014). For the noisy case, we propose a step-size 
regimen that is simply a weighted version of the step size for noise-free data, where the weights de¬ 
pend on the data and noise statistics. With this step size, we provide results guaranteeing monotonic 
improvements on the metrics of convergence in terms of expectation. 

Incremental gradient descent is our focus, motivated by streaming data applications. There are 
many applications of subspace estimation and tracking in medical imaging, communications, and 
environmental science; see more in Edelman et al. (1998); Balzano and Wright (2014); Balzano 
(2012). Matrix factors with orthogonality constraints, such as those given by the SVD, are also used 
in several data applications: they provide a unique collection of low-dimensional projections for data 
visualization, capture directions of maximal variance so as to give useful insights into data structure, 
and allow compressed storage of massive datasets with a precise notion of loss in compression. 

2. Formulation and Related Work 

We may formulate subspace estimation as a non-convex optimization problem as follows. Let M € 
^nxN ^ matrix that we wish to approximate with a subspace of rank d, and solve: 

minimize ||C/VF^ - M||| (1) 

subject to span ((/) € G{n, d) 

This problem is non-convex firstly because of the product of the two optimization variables U and 
W and secondly because the optimization is over the Grassmannian Q{n,d), the non-convex set of 
all d-dimensional subspaces in M"'. However, several methods^ can find fhe global minimizer of fhis 
problem in polynomial time under a variety of assumptions on M. 

In this paper, we are interested in approximating a streaming data matrix. At each step, we 
sample a column of M, denoted xj G M”. We consider the planted problem, where xt = vt + 
where is noise and vt is drawn from a continuous distribution with support on the true subspace, 
spanned by (7 G with orthonormal columns; vt = Ust, st G M'^. When = 0, we wish to 
find fhe U fhaf minimizes 

OO 

F{U) = y min WUwt-xtWh (2) 

t=l 

i.e., fhe span of fhe dafa vectors or fhe range of U, denofed R{U). When 0 we sfill discuss 
resulfs in ferms of fhe disfance from tJ. If we consider only t = 1,... ,N, Problem (2) is identical 
to Problem (1). The GROUSE algorifhm (Grassmannian Rank-One Updafe Subspace Esfimafion) 

1. For example, the power method can solve this problem if the top d singular values of M are dis¬ 
tinct Golub and Van Loan (2012). Specifically, considering d = 1, if the desired accuracy of the U output by the 
power method to the global minimizer is e*, and the first two singular values of M, and a 2 {M) are distinct 

with the ai{M) = ca 2 {M) for c > 1, then the power method converges in O f ^ ) iterations. 
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we analyze is shown as Algorithm 1, where we generate a sequence {Ut}t=o,i,... ofnxd matrices 
with orthonormal columns with the goal that R{Ut) —>■ R{U) as i oo. Each observed vector is 
used to update Ut to and we constrain the gradient descent method to the Grassmannian using 
a geodesic update Edelman et al. (1998). 

Because of the importance of the problem, it has been studied for decades, and there is a great 
deal of related work. We direct the reader to Edelman et al. (1998); Balzano (2012) for in-depth 
descriptions of algorithms and guarantees. We focus here on recent results that have global con¬ 
vergence guarantees to the global minimizer and study either gradient-type algorithms, algorithms 
that handle streaming data, or algorithms that maintain orthogonality constraints with manifold op¬ 
timization. 

Eirst we discuss incremental methods. De Sa et al. (2014) established the global convergence 
of a stochastic gradient descent method for the recovery of a positive definite matrix M in the un¬ 
dersampled case, where the matrix M is not measured directly but instead via linear measurements. 
They propose a step size scheme under which they prove global convergence results from a ran¬ 
domly generated initialization. Similarly, Balsubramani et al. (2013) invokes a martingale-based 
argument to show the global convergence rate of the proposed incremental PCA method to the sin¬ 
gle top eigenvector in the fully sampled case. In contrast, Arora et al. (2013) estimates the best 
d-dimensional subspace in the fully sampled case and provides a global convergence result by re¬ 
laxing the non-convex problem to a convex one. We seek to identify the d dimensional subspace by 
solving the non-convex problem directly. Einally, our work is most related to Balzano and Wright 
(2014), which provides local convergence guarantees for GROUSE in both the fully sampled and 
undersampled case. Our work focuses on global convergence but only in the fully sampled case; we 
will extend the global convergence results to the undersampled case in future work. 

Turning to batch methods, R.H.Keshavan (2012); Jain et al. (2013) provided the first theoretical 
guarantee for an alternating minimization algorithm for low-rank matrix recovery in the undersam¬ 
pled case. Under typical assumptions required for the matrix recovery problems Recht et al. (2010), 
they established geometric convergence to the global optimal solution. Earlier work Keshavan et al. 
(2010); Ngo and Saad (2012) considered the same undersampled problem formulation and estab¬ 
lished convergence guarantees for a steepest descent method (and a preconditioned version) on the 
full gradient, performed on the Grassmannian. Chen and Wainwright (2015); Bhojanapalli et al. 
(2015); Zheng andEafferty (2015) considered low rank semidefinite matrix estimation problems, 
where they reparamterized the underlying matrix as M = UU'^, and update U via a first order 
gradient descent method. However, all these results require batch processing and a decent initial¬ 
ization that is close enough to the optimal point, resulting in a heavy computational burden and 
precluding problems with streaming data. We study random initialization, and our algorithm has 
fast, computationally efficient updates that can be performed in an online context. 

Eastly, several convergence results for optimization on general Riemannian manifolds, includ¬ 
ing several special cases for the Grassmannian, can be found in Absil et al. (2009). Most of the 
results are very general; they include global convergence rates to local optima for steepest descent, 
conjugate gradient, and trust region methods, to name a few. We instead focus on solving the prob¬ 
lem in (2) and provide global convergence rates to the global minimum. 
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3. Convergence analysis 

We analyze Algorithm 1. At each step, the algorithm receives a vector xt = vt + it ^ such 
that vt = USt, St G and it is zero mean Gaussian noise. The algorithm then outputs an n x d 
matrix Ut with orthonormal columns at each iteration. We wish to recover U, i.e., the minimizer 
of Equation (2) when there is no noise. We would like to emphasize that in this scenario in a 
real application one would use the ISVD or a Gram-Schmidt procedure, but we seek convergence 
results for the Grassmannian gradient descent algorithm so that extensions can be made; e.g., we 
may regularize the cost function or we may minimize some other function of the data. Reliable 
global convergence of the GROUSE algorithm has been observed empirically, despite the fact that 
the algorithm is solving a non-convex problem and operating on a non-convex manifold. 

Algorithm 1 takes each vector xt, forms the gradient of min^ HC/m—xt |||, and takes a step in the 
direction of the negative gradient. The step is taken along the Grassmannian, the manifold of all d- 
dimensional subspaces of and according to the step size described and justified below. In words, 
the algorithm works as follows: Eirst we project our data vector onto the current subspace iterate 
to get the projection pt. Then we calculate the residual r*. The update to our subspace estimate Ut 
then requires only the addition of a rank-one matrix, as can be seen in Equation (4). This update is 
derived and explained in further detail in Balzano et al. (2010); Edelman et al. (1998). The rank-one 
update tilts Ut to no longer contain pt but instead contain a linear combination of pt and rp, in other 
words, it moves Ut towards the observation vt. 


Algorithm 1 GROUSE: Grassmannian Rank-One Update Subspace Estimation 
Given C/q, an n x d matrix with orthonormal columns, with 0 < d < n; 

Set t := 0; 

repeat 

Given observation xt = vt + it for vt G R{U)\ 

Define wt := argmin^ WUtw — xjHI; 

Define pt := Utvop, n := xt - Utwp, 

Using step size 

\\pt\\J ’ 


9t = arctan ( (1 — at) 


(3) 


where at = (l — p) where c > 0 and cr^ denotes the upper bound for the noise 

level (Condition 1), update with a gradient step on the Grassmannian: 


where 


t:=t + l- 

until termination 


Ut+i 



Pt \ wj 

WptWj IktII 


yt 


\\yt 


cos{6t)-r^ + sm{6t) 
\\Pt\\ 


n 


(4) 


Before we present our main results on the convergence of the GROUSE algorithm, we first call 
out the following definitions and condition that will be used throughout our analysis. 
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Definition 1 (Principal Angles) 'We use (t)i {U, Ut) ,i = 1,... ,d to denote the principal angles 
between subspaces R{Ut) and R{U), which are defined [Stewart and Sun (1990), Chapter 5] by 
coscj)iiU,Ut) = ai{U^Ut). 

Definition 2 (Determinant similarity) Our first metric is Q ^ [0,1], which measures the similarity 
between two subspaces and is defined as 


d 

Ct := detiU^UtU ^^) = 11 ■ (5) 

i=l 


Definition 3 (Frobenius norm discrepancy) Our second metric is et G [0, d], which measures the 
discrepancy between R{Ut) and R{U), and is defined as 

d 

et := sin2 fifiU, Ut) = d - \\U^Ut\\l . (6) 

i=l 


Condition 1 The inputs of GROUSE are xt = vt + ^t where vt = Ust with = 0, Cov{st) = I* 
and ^t R Gaussian random vector with entries being independently normal random variables such 
that E Further, we assume the energy of the underlying signals are finite, 

i.e., WvtW^ < oo. 


3.1. Optimal Adaptive Step Size 


In this section, we first derive a greedy step size scheme for each iteration t that maximizes the 
improvement on the defined mefrics (et, Ct) of convergence for fhe noiseless case, i.e., xt = vt. Lef 
II and vt^x denofe fhe projection and residual of vt onfo R{Ut). Then after each update we have 
the following (Appendix C): 


Ct+i 

Ct 


G - G+l 


\\u^yt\\ 


(7a) 

(7b) 


with -JiK = 11 ^*’^, cos 9t + II sin 6t. It follows that 


61 = arg max = arctan 

e Ct 


\\vt,. 




This is equivalent to (3) for the noise-free case setting at = 0. Using 6[, we obtain monotonic 


improvement on the determinant increment = 1 + 


discrepancy, we obtain et+i — = 1 — 


llw,iilh 




I tit 


vr- ^ 


> 1. For the Frobenius norm 


-; that is, et also achieves its maximal improvement. 


Therefore, when there is no noise in the observations, the proposed step size scheme described by 
(3) implies greedy learning rates with respect to the defined mefrics (et, Ct) of convergence. 
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For the noisy case, we propose a weighted step size schedule by restricting at € (0,1] with 
the goal that ^ 1 as our estimated subspace R{Ut) gradually converges to the true subspace 
R{U). The intuition behind this strategy is that, choosing the step size in Equation (3), the update 
of GROUSE follows as 



for which we have, if the noise is Gaussian distributed, lluf ~ lbt,±|P+ (l-d/n)||^i|p (whereby 
a ~ 6 we mean a concentrates around b), hence the noise part will gradually dominate the projection 
residual as R{Ut) R{U). It is therefore natural for us to consider incorporating less and less of 
the residual information into R{Ut) over time. Therefore, we propose the following schedule for a: 



( 8 ) 


where c > 0. As we will show in Section 4, with this weighted learning rate scheme, we obtain 
improvements in expectation on both Q and et- 

3.2. Convergence Without Noise 

In this section, we consider the noise-free case, that is xt = vt and vt € R{U). The step size (Eq 
(3)) used in this section has = 0 for all iterations. We provide analysis of the algorithm in two 
separate phases. In the first phase the GROUSE algorithm will converge to a local region of the 
global optimal point from a random initialization within 0{d?log{n)) iterations. Erom there, in the 
second phase GROUSE converges linearly to the optimal point. In each phase we use a different 
metric of convergence, which helps us obtain an overall faster convergence rate as compared to other 
work. The convergence rate with respect to only either determinant De Sa et al. (2014) or Erobenius 
norm discrepancy Jain et al. (2013) is either much slower within the local region De Sa et al. (2014) 
or slower in an initial phase from random initialization Jain et al. (2013). This is demonstrated 
numerically in Eigure 1 . 

Theorem 4 (Global Convergence of GROUSE) Suppose Condition 1 and that no noise is con¬ 
tained in the observations, i.e., xt = vt- Let e* > 0 be the desired accuracy of our estimated 
subspace using the metric in Definition 3. Initialize the starting point Uq of GROUSE as the or¬ 
thonormalization of an n X d matrix with entries being standard normal variables. Then for any 
p, p' > 0, after 


K>Ki + K2 



(9) 


iterations of GROUSE (Algorithm 1), 


r{eK<e*)>l-p' -p. 


(10) 


where pQ 


log '’ +d\og{e/d) 

' r} lr»cr n 


d\ogn 


with C > 0 a constant approximately equal to 1. 
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The proof of Theorem 4 is a direct combination of our analysis in two phases of the algorithm, 
stated in Theorem 5 and Theorem 6 below. 

Theorem 5 (Initial convergence of the determinant similarity Ct to Under the conditions of 
Theorem 4, for any p' € (0,1), after 

Ki > /iolog(n) 

iterations of GROUSE (Algorithm 1), 

>l-p' 

where pQ is the same as that in Theorem 4. 

Analyzing the determinant similarity turns out to be the key to proving convergence in this initial 
phase of GROUSE. The determinant similarity increases quickly toward 1 in the first phase. This 
also gives insight into how the GROUSE algorithm manages to seek the global minimum of a non- 
convex problem formulation: GROUSE is not attracted to stationary points that are not the global 
minimum. Eor our problem, all other stationary points Ustat have det{U^UstatUj^at^) = 0> be¬ 
cause they have at least one direction orthogonal to U Balzano (2012). If the initial point C/q has 
determinant similarity with U strictly greater than zero, and GROUSE increases the determinant 
similarity monotonically (as we mentioned in Section 3.1 and prove in Section 4), then we are guar¬ 
anteed to stay away from other stationary points. Since we initialize GROUSE using Uq uniformly 
from the Grassmannian, as the orthonormal basis of a random matrix V € with entries being 
independent standard Gaussian random variables, we guarantee Co > 0 with probability one. 

Theorem 6 (Local convergence of the Frobenius norm discrepancy et to 0) Suppose at itera¬ 
tion k we have Cfc > 1/2. Then for any p G (0,1) and given accuracy e*, after 

K 2 > 2d log 

additional iterations of GROUSE Algorithm 1, we have 

IF’(cfe-i-A'2 < e*) > 1 - P • 


In the first phase, we require O \og{n)/p'^ iterations to reach the local region of the global 
minimum, where 1 — p' is the probability with which wc’ll reach the local region. In simulations 
(Section 5, Eigure 2) with isotropic Gaussian data vectors from the subspace, we actually see that 
0{df\og{n)) iterations are many more than enough to reach the local region, without fail. Our 
analysis, though, only requires zero-mean uncorrelated identically distributed random data vectors. 
Bounds on higher moments may admit a tighter analysis, which wc leave for future work. 

The second phase only requires 0(dlog(l/e*p)) iterations to converge to e* accuracy in the 
Erobenius norm discrepancy metric given in Definition 3. This result is true to what we see in 
practice, as you can sec in Eigure 2. The analysis behind this result provides a tighter version of 
[Balzano and Wright (2014), Theorem 3.2] that both grows the local region of convergence and 
(slightly) improves the rate to be less dependent on the current value of et- 
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3.3. Convergence With Noise 

In this section, we study the convergence behavior of GROUSE with noise in each observation. Un¬ 
like the noise-free case, here we only provide expected monotonic improvements of our convergence 
metrics. As we prove in the appendix, the results we present here also imply the corresponding ones 
for the noiseless data. 


Theorem 7 (Expected convergence rate of the determinant similarity Q) Given Condition 1 is 
satisfied, after one iteration of GROUSE we have the following improvement of the determinant 
similarity in expectation: 


E 


Ci-l-l 


Ut 


> 1 + /3o 


1-Ct 


1 - 


a 


i-Ct 


-h cr 



where fio = 

This theorem implies that the expected convergence rate of determinant similarity is damped by 
the presence of noise. To be more specific, rewrite the expected improvement as E [Ct+i | Ut\ > 

^1 -|- ^ We can see that, comparing with the noiseless case, for small SNR 

(large cr^), the expected increment on Qt is approximately scaled by Hence the theoretical 

bound on the iterations necessary to achieve given accuracy C* in the small SNR case should roughly 
be at least d times that required by the noiseless case. For large SNR (small u^), the expected 
convergence rate is close to that of the noise-free case, as long as (^t is not too close to 1. Therefore, 
the required iterations to arrive at the local region of the true subspace should be close to that in the 
noiseless case. We show the corresponding numerical illustrations in Figure 1 and Figure 3. 


Theorem 8 (Expected convergence rate of the Frobenius norm discrepancy eft Under Condi¬ 
tion 1, we obtain the following upper bound on the decrease of Frobenius norm discrepancy et 
in expectation: 

E [£,+.|t/,] < (l - 

where fio = A = 1 — f, and the largest principal angle between R{Ut) and R{U). 

As indicated by Theorem 7, the expected convergence rate will slow down as Q increases. However, 
the above theorem implies that for large SNR (small a^), once we enter the local region of the true 
subspace, the convergence rate of the Frobenius discrepancy will take over. Specifically, when 
cos"^ (pt,d > 1/2, the convergence rate of et can be bounded from below by 1 — ^ as 

long as et > dfa'^. Therefore, an implication of Theorem 8 is that GROUSE will converge to 
a ball centered on the true subspace whose radius is determined by the noise level and subspace 
dimension. The convergence rate will slow as GROUSE approaches this ball. On the other hand, 
since 1 — e* < cos^ 4>t,d < 1 — et/d, by a simple calculation we can see that for small SNR (large 
(T^), the fast local convergence never kicks in. In that case, we only study the convergence behavior 
of GROUSE in terms of the determinant similarity Q- 

As we mentioned previously, with noise the improvement is not monotonic for either deter¬ 
minant similarity (Ct) or Frobenius norm discrepancy {et). This is a hurdle to pass before we can 
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provide similar global convergence results as we obtained for the noise-free case (Theorem 4). 
However, by leveraging techniques in stochastic process theory, it might be possible to establish 
asymptotic convergence results or even non-asymptotic convergence results in terms of the number 
of iterations required before GROUSE first achieves a given accuracy. We leave this as future work. 


4. Supporting Theory 


We first call out the following lemma to quantify the expectation of the determinant similarity be¬ 
tween our random initialization and the true subspace. For convenience, we will drop the subscript 
of all terms except et, Ct and Ut hereafter. 

Lemma 9 Nguyen et al. (2014) Initialize the starting point Uq of GROUSE as the orthonormaliza¬ 
tion of an n X d matrix with entries being standard normal variables. Then 

E[Co] = E [det(C/o^C7c7^[/o)] = C 
where C > 0 is a constant approximately equal to 1. 


As we mentioned in Section 3.1, both the determinant similarity C,t and the Frobenius discrepancy 
et improve monotonically in the noiseless case. We formally present this in the following lemma. 

Lemma 10 (Monotonic results for the noiseless case) When there is no noise, given the step size 
in Eq (3), after one update of GROUSE we obtain 

C.+1 , , l|fxll= , , 

Ihiip 

where ny and v± denote the projection and residual ofv onto R{Ut). 

For the noisy case, we provide the following lemmas, which are the intermediate results that 
allow us to establish the expected improvements on both Qt and et in Section 3.3. 


Lemma 11 Given Condition 1 is satisfied, after one update of GROUSE we obtain the following 

\\rf 


E[Cm|C^t] > 1+E 


(l-«r 


IIpII 


Ut 


Ct 


Lemma 12 After one iteration of the GROUSE algorithm, we have the following 

im^pf 


E [et - et+i\Ut] = E 


1 -7^- 


Ut 


where TZ = 

Q:r|p 


According to our definition of a in Section 3.1, we can see that when R{Ut) is not close to R{U), 
1 — a is large, as is ||r|p/||p|p. Therefore, Lemma 11 implies that the expected convergence rate 
of the determinant similarity (Q) is faster in the first phase. For the Frobenius norm discrepancy 
(et), comparing to the noiseless case where p = Femma 12 implies that we obtain monotonic 
expected decrease in Frobenius norm discrepancy as long as we are outside a ball centered on the 
true subspace. This ball shrinks as —)■ 0, with no such constraint for = 0. As we approach 
this ball, the expected convergence rate slows. 
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Frobenius norm discrepancy (et) determinant similarity (^t) 




iteration iteration 

Figure 1: Illustration of expected convergence bounds given by Theorem 8 (left) and Theo¬ 
rem 7 (right) over 100 trials. In this simulation, n = 2000, d = 20 and cr^ G 
{0, le — 5, le — 3, le — 1,1}. The red dashed line indicates the linear convergence rate, 
while the diamonds denote the lower bound on expected convergence rate in Theorem 
7. We can see that the convergence rate of each phase slows down in the noisy case. 
However, when cj^ is small, the convergence behavior of GROUSE similar to that of the 
noise-free case. We get a faster convergence rate of in the initial phase an almost linear 
convergence of et in the local region oi R{U). 


5. Numerical Results 

With our plots we illustrate why the two analysis approaches allow us to prove rates in both phases 
of GROUSE. Eor each numerical result in this section, we initialize GROUSE with orthonormalized 
Gaussian matrices with entries iid AA(0,1). The underlying subspace of each trial is set to be a sparse 
subspace, as the range of an n x d matrix U with sparsity on the order of Wg generate the 

coefficient matrix W with entries i.i.d M{0, 1). Eor the noisy case, we then normalize the columns 
of the underlying matrix UW and add a noise matrix N, with Nij ~ M{0, a'^ In). In the noisy case, 
we run GROUSE with the step size described in Equation (8), where we set c to its expected value 
1 . 

As is demonstrated in Eigure 1, when there is no noise in the observations or the SNR is large 
enough, the determinant similarity {(t) increases quickly in the initial phase, while the Erobenius 
norm discrepancy (e*) decreases slowly. Then in a local region of the true subspace, our accurate 
bound on the fast convergence of the Erobenius norm discrepancy takes over. However, if the SNR 
is small, the convergence rate of the Erobenius norm discrepancy slows down; in this scenario we 
only study the convergence of GROUSE in terms of the determinant similarity. In Eigure 1, we 
show that the convergence rate of determinant similarity will also slow down as we increase the 
magnitude of however, the convergence rate described in Theorem 7 is still tight. This allows us 
to obtain a good enough approximation of the number of iterations required to reach a ball around 
R{U), which is captured by Ki alone in this case. 
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Figure 2: Illustration of the bounds on Ki and K 2 compared to their values in practice, averaged 
over 50 trials with different n and d. We show the ratio of Ki to the bound log(n) 
in the initial phase (left) and the ratio of K 2 to the bound dlog(l/e*) in the local phase 
(right). 


We next examine the tightness of our theoretical values of Ki and K 2 for noiseless convergence 
in Figure 2. We run GROUSE to convergence for a required accuracy e* = le — 4 and divide the 
iterations into Ki, the number to reach Q > ^, and K 2 , the remaining number to reach et < e*. 
We show the ratio of Ki to the bound d^ log(n) in the initial phase (top plot) and the ratio of K 2 
to the bound d\og{l/e*) in the local phase (bottom plot). We run 50 trials and show the mean and 
variance. We can see that the value for Ki is very loose. On the other hand, the value for K 2 is very 
accurate; 0((ilog(l/e*)) iterations are required to get to accuracy e*. 

Finally, we examine the tightness of approximated Ki and K 2 for the noisy case in Fig 3. As we 
mentioned in Section 3.3, for small SNR (large cr^), the necessary number of iterations to achieve 
the given accuracy should be roughly d times that required by the noise-free case, while for large 
SNR (small cr^), this ratio would be less. For large SNR, we first run GROUSE to reach the local 
region of the true subspace, i.e., (ki > and record iFi; from this point we run GROUSE to 
converge to e* = and then record K 2 and compare it with that required by the noise-free 
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case. For small SNR (large a^), we only numerically examine the convergence rate of the first 
phase, i.e., necessary iterations to achieve the given accuracy (ki > (l — ^ 2 ^ 0 '^) /n^ 

As we can see in Figure 3, we test Ki versus 0{d^ log(n)), and as in noiseless case the bound on 
Ki is loose. For small noise, the bound on K 2 is tight and stable. 





Figure 3: Illustration of the bounds on Ki,K 2 over d and a with fixed n, averaged over 50 fri- 
als. In this simulation, n = 5000 with d ranging from 10 to 100 and cr^ chosen 
from le-3 to 1 (above) and le-8 to le — 2 (below). We run GROUSE to converge to 
min |l/2, and record the corresponding Ki. For large SNR (small u^) we 

first run GROUSE to converge to 1/2 and record Ki, then let GROUSE further converge 
to e* = max |fT^, and record K that gives K 2 = K — Ki. Eor this plot we set 

both Ti, r 2 be log((i). 


6. Conclusion 

This paper has provided the first global convergence result for an incremental gradient descent 
method on the Grassmannian for noise-free data. Eor optimizing a particular cost function (2) in the 
noiseless case, we showed that the gradient algorithm converges from any random initialization to 
the global minimizer. Our novel analysis shows the convergence happens in two phases: the initial 
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convergence and the local convergence. In the initial phase, we provided a very loose bound on the 
number of iterations Ki = 0{d^ log(n) /p') required to get to a local region of the global minimizer 
from the random initialization with probability 1 — p'. In fact, this phase usually takes many fewer 
iterations and reaches the local region in all empirical trials. In the local phase for the noiseless 
case, we provided a very tight bound for the required iterations K 2 = 0{dlog{l/e*) to achieve a 
final desired accuracy of e*. 

When the observations contain noise, we establish a rate of expected improvement of both of 
our metrics Q and e* for all iterations t. Establishing the global convergence result remains as future 
work. 

References 

P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. 
Princeton University Press, 2009. 

Diego Armentano, Carlos Beltran, and Michael Shub. Average polynomial time for eigenvector 
computations. arXiv preprint arXiv:1410.2179, 2014. 

Raman Arora, Andy Cotter, and Nati Srebro. Stochastic optimization of PCA with capped MSG. 
\n Advances in Neural Information Processing Systems, pages 1815-1823, 2013. 

Akshay Balsubramani, Sanjoy Dasgupta, and Yoav Freund. The fast convergence of incremental 
PCA. In Advances in Neural Information Processing Systems, pages 3174—3182, 2013. 

Laura Balzano and Stephen J Wright. Local convergence of an algorithm for subspace identification 
from partial data. Foundations of Computational Mathematics, pages 1-36, 2014. 

Laura Balzano, Robert Nowak, and Benjamin Recht. Online identification and tracking of subspaces 
from highly incomplete information. In Communication, Control, and Computing (Allerton), 
2010 48th Annual Allerton Conference on, pages 704-711. IEEE, 2010. 

Laura Kathryn Balzano. Handling missing data in high-dimensional subspace modeling. PhD 
thesis. University of Wisconsin - Madison, 2012. 

Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping convexity for faster 
semi-definite optimization. arXiv preprint arXiv:1509.03917, 2015. 

J Paul Brooks, JH Dula, and Edward L Boone. A pure 11-norm principal component analysis. 
Computational statistics & data analysis, 61:83-98, 2013. 

Emmanuel J Candes, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? 
Journal of the ACM (JACM), 58(3): 11, 2011. 

Yudong Chen and Martin J Wainwright. Fast low-rank estimation by projected gradient descent: 
General statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025, 2015. 

Alexandre d’Aspremont, Francis Bach, and Laurent El Ghaoui. Optimal solutions for sparse prin¬ 
cipal component analysis. The Journal of Machine Learning Research, 9:1269-1294, 2008. 


13 



Zhang Balzano 


Christopher De Sa, Kunle Olukotun, and Christopher Re. Global eonvergenee of stoehastie gradient 
deseent for some noneonvex matrix problems. arXiv preprint arXiv:1411.1134, 2014. 

Alan Edelman, Tomas A Arias, and Steven T Smith. The geometry of algorithms with orthogonality 
eonstraints. SIAM journal on Matrix Analysis and Applications, 20(2): 303-353, 1998. 

Gene H Golub and Charles F Van Loan. Matrix computations. JHU Press, 4 edition, 2012. 

Jun He, Laura Balzano, and Arthur Szlam. Ineremental gradient on the grassmannian for online 
foreground and baekground separation in subsampled video. In IEEE CVPR, June 2012. 

Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix eompletion using alter¬ 
nating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of 
computing, pages 665-674. ACM, 2013. 

Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix eompletion from a few 
entries. Information Theory, IEEE Transactions on, 56(6):2980-2998, 2010. 

Thanh Ngo and Yousef Saad. Sealed gradients on grassmann manifolds for matrix eompletion. In 
Advances in Neural Information Processing Systems, pages 1412-1420, 2012. 

Hoi H Nguyen, Van Vu, et al. Random matriees: Law of the determinant. The Annals of Probability, 
42(1):146-167, 2014. 

Benjamin Reeht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear 
matrix equations via nuelear norm minimization. SIAM review, 52(3):471-501, 2010. 

R.H.Keshavan. Efficient algorithms for collaborative filtering. PhD thesis, Stanford University, 

2012 . 

Peter Riehtarik and Martin Takac. Iteration eomplexity of randomized bloek-eoordinate deseent 
methods for minimizing a eomposite funetion. Mathematical Programming, 144(1-2): 1-38, 
2014. 

Gilbert W Stewart and Ji-guang Sun. Matrix perturbation theory. Aeademie press, 1990. 

Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via outlier pursuit. \n Advances 
in Neural Information Processing Systems, pages 2496-2504, 2010. 

Qinqing Zheng and John Lafferty. A eonvergent gradient deseent algorithm for rank minimiza¬ 
tion and semidefinite programming from random linear measurements. In Advances in Neural 
Information Processing Systems, pages 109-117, 2015. 


14 



Global Convergence oe GROUSE 


Appendix 

We first call out the following definitions that will be used frequently throughout the whole proof. 
For convenience, we will drop the subscript of all terms except et and and Q hereafter. 


Definition 13 Let denote the projection and residual of v onto R{U), i.e., uy = UU'^v and 

v± = V — = {I — UU^)v, and similarly let = UU'^^ and = C “ '^|| = ~ It 

follows that, 


P = v\\+ ,^11 and r = u_L + ^_L 


Define wi and W 2 as 


rp rp 

Wi = U V = U Uy 


and W2 = = U'^C\\ 


Appendix A. Preliminaries 

We start by providing the following lemma that we will use regularly in the manipulation of the 
matrix V^U, which is relevant for both our metrics of discrepancy and similarity between the 
subspaces. The proof can be found in Stewart and Sun (1990). 


Lemma 14 (Stewart and Sun (1990), Theorem 5.2) There are unitary matrices Q, Y, and Y such 


that 


QtJY 


d 

d n\ 

d 0 = Ai, 

n — 2d\oJ 


d 

d /r\ 

QUY := d S = As, 

n — 2d\0 / 


where T = diag (cos ..., cos S = diag {sinft,!, ■■■ with ft,i being the 

principal angle between R{U) and R{U) defined in Definition 1. 


In words, there are unitary matrices Q, Y, and Y such that U = Q^AiY^ and U = Q^AsL^. 
Letting B = U'^UU'^U, we call out the following simplified quantities for future reference : 


B = YT^Y'^ and B'^B = YT^Y'^ . 


( 11 ) 


Next we present the following two lemmas that are for us to relate the projection (uy) and 
residual (u_l) to both of our metrics et and Q- The proofs can be found in Appendix D. 


Lemma 15 (Balzano and Wright (2014), Lemma 2.12) Given any matrix Q € suppose 

that X G is a random vector with entries identically distributed, zero-mean, and uncorrelated, 
then 


E 


x"^Qx 

x'^x 


i<r(Q) 
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Lemma 16 Let X = [Xi, • • • , X^] with Xj G [0,1], i = 1,..., d, then 

d 

> 1-ntiX, 

i=l 

Given Hf^^Xi > we have 


d 

^=1 


Now we are ready to present the following lemma that shows a relationship relating the projec¬ 
tion (r;||) and residual (u_l) to our metrics et and Q in terms of expectation. This is a central result 
used to obtain the expected convergence rates of both e* and Q. 


Lemma 17 Under Definition 13, we have the following 


E 


U 



[ ||n||2 


d 

E 


u 

>1-0 


[ ||u||2 


- d 


E 


|(/-i 7 i 7>ii|| 


u 


cos'^fid 
- d 


(12a) 

(12b) 

(12c) 


where fid the largest principal angle between our estimated subspace R(Ut) and the true subspace 
R{U) (Definition 1). 


Proof According to Lemma 15, we have 

■||i7s||2 - wuu^usf 


E 


u 

= E 


1^2 






u 


^ E 


s^'Y{I - T‘^)Y'^s 


s'^s 


U 




d 

= u/d 

where di follows by ||f/s|p = ||s|p and 1 I 2 follows from Lemma 15. 
(12b) is a direct result of Lemma 16 by setting X = diag(r2), i.e., 

|2 


E 


|n±| 


U 




Finally for (12c), we have 

E 


■||(/-C7C7>I|||' 

TT 

— F 

's'^YT^il -T^)Y'^s 

TT 

1^2 



s’^s 



= itr(r2(/-r2)) 


> cos'^fidtr (/- r^) • ^ 
cos^ fid 
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Appendix B. Proof of Main Results 
B.l. Noiseless Case 


In this section, given the results presented in Section 4 and Appendix A, we provide the proofs of 
our main results (Theorem 5, 6). 

Proof of Theorem 5 Proof According to Lemma 10, Ct is an non-decreasing sequence. There¬ 
fore, there exists T > 1 such that Ct < 1 “ Vf < T where p' € (0,1]. Then Lemma 10 together 
with Lemma 17 yield the following, for any t <T, 


E., 


Ct-i-i 


Ct 


u 


> l + E 


St 




U 


> 1 + 


1-Ct 




It follows that 


E[C,+.|(/] > (i + ^)c. 

Taking expectation of both sides, we obtain the following 


lE[Ct+i]> (l + ^)E[Ct] 


(13) 


(14) 


(15) 


Therefore after Ki > {d?/p' + 1) log iterations of GROUSE we have 


e[CkJ>(i + ^) 'e[Co]> 


1 + 


,, il+i\ 

d? 


E[Co] > E[Co]e'°® ^[Col = 1 - 


Therefore, 


f ’’i E[l-Cir,l 


(16) 


where follows by applying Markov inequality to the nonnegative random variable 1 — Cki- If 
T < Ki, then (16) automatically holds. Therefore, together with the following derived from Lemma 

9, we obtain log ( = log = f^odlog{n) where po = 1 + '°s c +diogie/d) _ 


dlogn 


Proof of Theorem 6 Proof This proof follows the same reasoning as the proof of [Theorem 1, 
Richtarik and Takac (2014)]. 

Given Lemma 10 we have 


E[et+i\U] <et-E ^ 


U < et-E 


11(1-[/[/’ 


, 1^2 / cos^ (bt d 
U \ <[l - ^ I et 
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where holds since UU'^ and I — UU'^ are orthogonal complement of each other; and follows 
from Lemma 17. Also note that Lemma 10 implies that cos^ > Ct > Ck > 1/2, Vf > k, 
therefore 

IE [ei+i|[/] < e* (1 — l/2d) . 

Taking expectations of both sides and considering et at t = k + K 2 with K 2 > 2(ilog (1/e*/?), we 
have 


IE[efc+A'2] ^ ^k 



1?3 

< E [efc] 


^ \ 2dlog(E[efc]/e*p) 


< IE[efc]e"^°s(Eh;=]/^*p) 


P 


where ^3 follows by Lemma 16, i.e., given Cfc > 1/2 we have e^ < 1. Again using the Markov 
inequality for the nonnegative random variable e* we have 

nek+K2 < e*) = 1 - n^k+K2 > e*) > 1 - > 1 - p . 


B.2. Noisy Case 

Before we prove the main results (Theorem 7,8) of the noisy case, we first call out the following 
lemmas that will be used frequently throughout the proofs in this section. The proofs are provided 
in Appendix D. 


Lemma 18 Given Condition 1 is satisfied, we have 


E [||?±|P|?^] < (1 — d/''T‘)(^'^\\v\\‘^ 



E [Ikf |n] 


E[||e||||>] < 


d 2 || ||2 

—a n 
n 


< E [||n_L|P|n] + 



a 


V 


2 


Lemma 19 Letting Ai = ^U'^r and A 2 = — UU'^){v\\ — ^x). we have 

E[Ai|[/]=0 E[A2|t/]=0 


Proof The proof can be found in Appendix D. 


Now we are ready to prove Theorem 7 and Theorem 8. 
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Proof of Theorem 7 Proof Given Lemma 11 and that 1 — a = , we have 


E 


( 1 -a)^ 


U 


= E 


ii^±ini^±ip 




u 

> E^ 




E(||r||2|n) E(||p||2|n) 


U 


'd2 

> 


l + fa 2 


1?3 

> 


1 


1 + 

n 


i?4 

> 


n 
1 


E„ 


E 


ll^xIlVlIH 


II^±IP 


|| nj _|| 2 /|| n ||2 + (1 — ( i / n ) iT 2 ||^;||2 


U 




1-0 


u 


1 - 


1 - 


a 


E 




+ 0-2 


a 


(l + ^cj2) d V (1 - Ci)/c^ + 0-^ 


where follows by the fact that and are independent of each other, and for each iteration t 
given U and v, f y and v_\_ are fixed, then applying Jensen’s Inequality yields the results; follows 
from Lemma 18; again we apply Jensen’s Inequality in terms of ll^±llVll^f for ; and ^^4 follows 
from Lemma 17. ■ 


Proof of Theorem 8 Proof According to Lemma 12 we have the following. 


■di 


E [et - ei+i|i7] > E, 


= E„ 


^3 

> E,. 


E 


E 


\\{I-UU^)pr \\{I-UU^){i-ar) 


v,U 


IIpP ipp 

11(1 - p - 11(7 - UU^)ia - ar)p + 2(1 - a)Ai 


v,U 


||(/-[7[7>yp-E[||ex-arp|n,C/] 


1?4 1 

> - 


1 + ^Cj 2 
n 


E„ 


E[\\pf\v,U] 

||(/-i7i7>yp lln^ll^ 


(1 — djn)a‘^ 


PP |P±P/|PP + (1 — (i/n)cT2 


U 


^5 1 

> 




E,; 


\\{I-UU^)vii 


U 


-E 


IP±P 


1 

> 


1 + -cj 2 V d 


cos^ (l)d 


d 


et - [I--](T 


IpII 

f^t/d 


U 


(1 — dln)a^ 


E [|p±P/|p||2|C/] + (1 -d/n)a2 


n) et! d + [1 — dln)(T‘^ 


where di holds since 1 — ||C/7/'^pp/||pp = ||(/ — f7f7^)pp/||pp and \\p + (1 — a)rp = 
IIpP + (1 — Q:)^||rp > Ipp; d 2 follows by 

(1 - a)Ai = ef (I' - UU^) Pll - + ar) = (1 - a)ef (/ - UU^){v\i - 

ds follows by Lemma 19 and the fact given v, p and ^_l — or are independent, then applying Jensen’s 
inequality; and d^ holds due to Lemma 18 and the following: 


a = i — 


, lloiP l|{xll" + 2{It.x 

i ,, ,, ,,2 


19 
































































Zhang Balzano 


which implies 


■p_L - arp 

71 TJ 

— F 

P_Lp-a(2^]^r-a||rp) 

71 TJ 

— F 

(M 

H 

W" 

1 

1 — 1 

71 TJ 

[ PIP 



1 - 

to 



pp 



“ II^P E[\\r\\‘^\v,U] 

Il'f^xlP (1 — d/n)a^\\v\\'^ 
II^P Wv^W^ + {1 — d/n)(j‘^\\v 

where in dj we used Jensen’s Inequality in terms of 

Finally, for we again apply Jensen’s Inequality and then use Lemma 17 for iJg to complete 
the proof. ■ 


Appendix C. Proof of Supporting Theory 


Proof of Lemma 12 Proof We first rotate Ut and Ut+i via an orthogonal matrix defined as 


Wt := 



where Z is an d x (d — 1) mafrix wifh orfhonormal columns whose columns span N{w'^), where 
N{-) denotes fhe nullspace and so fhis gives fhe orfhogonal complemenf of w. Note fhaf R{Ut) is 
unchanged by mulfiplying Ut wifh Wt, i.e, R{Ut) = R{UtW). This allows us fo equivalenfly write 
fhe updafe equafion as follows, 


Ut+iW 


UtW + 


'_y _ p_ 

w'^ 

„ ^^W-UtW + 

'_y _ 

[pll PIlJ 

IP 

[pll IPIlJ 


[1 0 0 ... 0 ] 


Since Frobenius norm is invarianf under orfhogonal fransformalion, if direcfly implies fhe following: 

'lUU^yf WUU^pP 


et - et+i = \\U^Ut+tW\\p - \\U^UtW\\p = 


lldll 


IIpII 


(17) 


Nofe fhaf I — UU^ is fhe orfhogonal complemenf of UU'^, and (I — UU'^)v = 0, hence we have 
fhe following. 


WUU^yf \\um{v + ^-ar)f ||(/_ _ ^^)||2 

lIdP |p + (^ —||n + ^ —arP 


Remark 20 (Proof of the Second Claim in Lemma 10) Note that when there is no noise, y = v £ 
R{U) and p = v\\, therefore (17) implies the following 

WUU^yf im^pr WUU^vwf 

lIdP IIpP PiiP ^ ^ 

this completes the proof of the second claim of Lemma 10. 
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Proof of Lemma 11 Proof Since det (^4 + ab'^) = det(A) (l + A ^o), we obtain 
det U^Ut+i = det (U^Ut + ^ 


= det{U^Ut) • 1 + 


'll bli / 11^^ 
w^{U^U)-W^y W^iu^uyw^p 


||p||||p + (1 - a)r| 


IIpII 


(19) 


Note that 


uF{U^U)-^p = uF {U^Uy^U^Uw = ||n;f 

wl {u'^uyy = s'^u'^uiu'^uyy = v'^r = ||T;xf + v±y 

together with the fact \\p + (1 — a)r|p = ||p|p + (1 — a)^||r|p yields 


E 

Ct+i 

u 


Ct 



= E 


= E 


1?2 

> E 


^ E 


/ IIpIP + (1 - g) (||r;x|P + v'jy + wyu'^Ut) ^U^r) 

\ ||p+(1 - a)r|| IIpII 

/ IIpIP + (1 - a)^ ||r|p + (1 - a)A 2 
\ ||p+(l-a)r||||p|| 

H2 + (l-a)2||rf A2 

+ 2(1 - U 


U 


U 


\\P\\ 




U 


( 20 ) 


where -di follows by ||fx|p = (1 — a)||r|p and A 2 = + re^(C/^C/f)“^C/'^r; 1)2 follows 

from p ± r, this implies \\p + (1 — Q;)r|p = ||p|p + (1 — a)^||r|p; and 7)3 holds since ac¬ 
cording to Lemma 19 and our assumptions that ||u|p < 00 this implies that ||ux|P and with the 
Gaussian noise both ||p|p and ||r|p are bounded with probability equals 1. Therefore, we have 
E[2(l-a)Ai/||p||2|[/] =0. ■ 


Remark 21 (Proof of the First Claim in Lemma 10) For the noise-free case, y = v and w = 

U^V Ust, w'^iu^uyw'^y = s'^u'^uiu'^uyw^v = ||r;||2. Hence (19) yields 

Ct+i IblP , , ll^xlP 

C* IbllP Ill'll P 

this completes the proof of the first claim of Lemma 10. 


Appendix D. 


Proof of Lemma 15 Proof This proof is identical to that of [Lemma 2.12, Balzano and Wright 
(2014)], hut we note that their assumption that x he Gaussian is not necessary. 


E 


1 

^ f- 

1_ 

II 

M 

X<iX j 

IItIIt 

d 

Qij + E 

\ 1 
Ibll? 

L - - j 


L 11*^112 J 

i=l 

L 11*^ II 2 J 


Qn = ^E 
i=l 


4mq) 
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where follows from the fact that E 



0 and ^2 follows by 



d 

™2 


xj 


2=1 


LLILJ 

[Eti^U 


since each Xi is identically distributed. 


i = 1,... ,d 


Proof of Lemma 16 Proof The proof of the first claim is similar to that of [Lemma 16, 
De Sa et al. (2014)]; we briefly sketch it here. Let fi{X) = d — 1 — Yli=i then 

= — 1 + Hj^iXi < 0. That is, fi{X) is a decreasing function of each component. Hence 


/i(X) >/(!) = 0 


For the second claim, let f 2 {X) 
have 


df2iX) 


dX, 


2 (1 - ntiX,) -d + Y.Li Xi, then given H^^Xi > 1/2, we 


2n,yiXi + 1 < 0 ^ /2(X) > /2(1) = 0 


Proof of Lemma 18 Proof Note that ||^|p, ||^±|P and ||C|||P are all distributed random vari¬ 
ables with degrees n,n — d and d. This implies the first two parts. 

And for the last two inequalities, we have 


E 





= E^ [E [^'^UU^v\v]] = 0 
= E^ [^^{I -UU^)v\v] = 0 


which implies the following: 


E [||r|p|r;] = E [||ti_i_|p|n] + E [||^_i_|p|n] < E [||n_L|p|n] + 

E [Ilf^lPI'w] = E [||f|||p|f] +E [ll^iilPlu] < E [||f|||p|f] + -(T^||n|p < + 

71 \ 



Proof of Lemma 19 Proof 

E[Ai|C/] =E„ [E [v^{I -UU^)C±\U,v]] + E^ [E [C^U{U^U)-^U^v±\U,v]] 
+ E - UU^)^\U] 

= tr (E [C^U{U^U)-W^{I - UU^)C\U]) 

= tr {{U^uyw^il - UU^)E U) 
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where holds sinee the following, 

¥.[v'^{I -UU^)^i_\U,v]] =W.[v^ {I - UU^)i\U,v] =0 

¥.[i^U{U^U)-^U^Vi\U,v\ =K[(^U{U^U)-W^{I-UU^)\U,v\ =0 (21) 

and §2 holds sinee E = In- 

The seeond argument following hy similar argument 

E[A 2 |C/] = E^ [E [(^U{I - i7c7^)u|||c/,u]] +tr ((/ - C7c7^)(/ - UU^)K [/) = 0 
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